{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Extract Datasets and Establish Benchmark\n",
    "\n",
    "**Learning Objectives**\n",
    "- Divide into Train, Evaluation and Test datasets\n",
    "- Understand why we need each\n",
    "- Pull data out of BigQuery and into CSV\n",
    "- Establish Rules Based Benchmark\n",
    "\n",
    "## Introduction \n",
    "In the previous notebook we demonstrated how to do ML in BigQuery. However BQML is limited to linear models.\n",
    "\n",
    "For advanced ML we need to pull the data out of BigQuery and load it into a ML Framework, in our case TensorFlow.\n",
    "\n",
    "While TensorFlow [can read from BigQuery directly](https://www.tensorflow.org/api_docs/python/tf/contrib/cloud/BigQueryReader), the performance is slow. The best practice is to first stage the BigQuery files as .csv files, and then read the .csv files into TensorFlow. \n",
    "\n",
    "The .csv files can reside on local disk if we're training locally, but if we're training in the cloud we'll need to move the .csv files to the cloud, in our case Google Cloud Storage."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set up environment variables and load necessary libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip freeze | grep google-cloud-bigquery==1.21.0 || pip install google-cloud-bigquery==1.21.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After installation, restart the kernel by selecting **Kernel** > **Restart Kernel**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROJECT = \"cloud-training-demos\"  # Replace with your PROJECT\n",
    "REGION = \"us-central1\"            # Choose an available region for Cloud MLE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ[\"PROJECT\"] = PROJECT\n",
    "os.environ[\"REGION\"] = REGION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext google.cloud.bigquery"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Review\n",
    "\n",
    "In the [a_sample_explore_clean](a_sample_explore_clean.ipynb) notebook we came up with the following query to extract a repeatable and clean sample: \n",
    "<pre>\n",
    "#standardSQL\n",
    "SELECT\n",
    "  (tolls_amount + fare_amount) AS fare_amount, -- label\n",
    "  pickup_datetime,\n",
    "  pickup_longitude, \n",
    "  pickup_latitude, \n",
    "  dropoff_longitude, \n",
    "  dropoff_latitude\n",
    "FROM\n",
    "  `nyc-tlc.yellow.trips`\n",
    "WHERE\n",
    "  -- Clean Data\n",
    "  trip_distance > 0\n",
    "  AND passenger_count > 0\n",
    "  AND fare_amount >= 2.5\n",
    "  AND pickup_longitude > -78\n",
    "  AND pickup_longitude < -70\n",
    "  AND dropoff_longitude > -78\n",
    "  AND dropoff_longitude < -70\n",
    "  AND pickup_latitude > 37\n",
    "  AND pickup_latitude < 45\n",
    "  AND dropoff_latitude > 37\n",
    "  AND dropoff_latitude < 45\n",
    "  -- repeatable 1/5000th sample\n",
    "  AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 5000)) = 1\n",
    "  </pre>\n",
    "  \n",
    "We will use the same query **with one change**. Instead of using `pickup_datetime` as is, we will extract `dayofweek` and `hourofday` from it. This is to give us some categorical features in our dataset so we can illustrate how to deal with them when we get to feature engineering. The new query will be:\n",
    "\n",
    "<pre>\n",
    "SELECT\n",
    "  (tolls_amount + fare_amount) AS fare_amount, -- label\n",
    "  EXTRACT(DAYOFWEEK from pickup_datetime) AS dayofweek,\n",
    "  EXTRACT(HOUR from pickup_datetime) AS hourofday,\n",
    "  pickup_longitude, \n",
    "  pickup_latitude, \n",
    "  dropoff_longitude, \n",
    "  dropoff_latitude\n",
    "-- rest same as before\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Split into train, evaluation, and test sets\n",
    "\n",
    "For ML modeling we need not just one, but three datasets.\n",
    "\n",
    "**Train:** This is what our model learns on\n",
    "\n",
    "**Evaluation (aka Validation):** We shouldn't evaluate our model on the same data we trained on because then we couldn't know whether it was memorizing the input data or whether it was generalizing. Therefore we evaluate on the evaluation dataset, aka validation dataset.\n",
    "\n",
    "**Test:** We use our evaluation dataset to tune our hyperparameters (we'll cover hyperparameter tuning in a future lesson). We need to know that our chosen set of hyperparameters will work well for data we haven't seen before because in production, that will be the case. For this reason, we create a third dataset that we never use during the model development process. We only evaluate on this once our model development is finished. Data scientists don't always create a test dataset (aka holdout dataset), but to be thorough you should.\n",
    "\n",
    "We can divide our existing 1/5000th sample three ways 70%/15%/15%  (or whatever split we like) with some modulo math demonstrated below.\n",
    "\n",
    "Because we are using a hash function these results are deterministic, we'll get the same exact split every time the query is run (assuming the underlying data hasn't changed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### **Exercise 1**\n",
    "\n",
    "The `create_query` function below returns a query string that we will pass to BigQuery to collect our data. It takes as arguments the phase (`TRAIN`, `VALID`, or `TEST`) and the sample_size (relating to the fraction of the data we wish to sample). Complete the code below so that when the phase is set as `VALID` or `TEST` a new 15% split of the data will be created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_query(phase, sample_size):\n",
    "    basequery = \"\"\"\n",
    "    SELECT\n",
    "        (tolls_amount + fare_amount) AS fare_amount,\n",
    "        EXTRACT(DAYOFWEEK from pickup_datetime) AS dayofweek,\n",
    "        EXTRACT(HOUR from pickup_datetime) AS hourofday,\n",
    "        pickup_longitude AS pickuplon,\n",
    "        pickup_latitude AS pickuplat,\n",
    "        dropoff_longitude AS dropofflon,\n",
    "        dropoff_latitude AS dropofflat\n",
    "    FROM\n",
    "        `nyc-tlc.yellow.trips`\n",
    "    WHERE\n",
    "        trip_distance > 0\n",
    "        AND fare_amount >= 2.5\n",
    "        AND pickup_longitude > -78\n",
    "        AND pickup_longitude < -70\n",
    "        AND dropoff_longitude > -78\n",
    "        AND dropoff_longitude < -70\n",
    "        AND pickup_latitude > 37\n",
    "        AND pickup_latitude < 45\n",
    "        AND dropoff_latitude > 37\n",
    "        AND dropoff_latitude < 45\n",
    "        AND passenger_count > 0\n",
    "        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N)) = 1\n",
    "    \"\"\"\n",
    "\n",
    "    if phase == \"TRAIN\":\n",
    "        subsample = \"\"\"\n",
    "        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) >= (EVERY_N * 0)\n",
    "        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) <  (EVERY_N * 70)\n",
    "        \"\"\"\n",
    "    elif phase == \"VALID\":\n",
    "        subsample = \"\"\"\n",
    "        # TODO: Your code goes here\n",
    "        \"\"\"\n",
    "    elif phase == \"TEST\":\n",
    "        subsample = \"\"\"\n",
    "        # TODO: Your code goes here\n",
    "        \"\"\"\n",
    "\n",
    "    query = basequery + subsample\n",
    "    return query.replace(\"EVERY_N\", sample_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Write to CSV\n",
    "Now let's execute a query for train/valid/test and write the results to disk in csv format. We use Pandas's `.to_csv()` method to do so."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### **Exercise 2**\n",
    "\n",
    "The `for` loop below will generate the TRAIN/VALID/TEST sampled subsets of our dataset. Complete the code in the cell below to 1) create the BigQuery `query_string` using the `create_query` function you completed above, taking our original 1/5000th of the dataset and 2) load the BigQuery results of that `query_string` to a DataFrame labeled `df`. \n",
    "\n",
    "The remaining lines of code write that DataFrame to a csv file with the appropriate naming."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.cloud import bigquery\n",
    "bq = bigquery.Client(project=PROJECT)\n",
    "\n",
    "for phase in [\"TRAIN\", \"VALID\", \"TEST\"]:\n",
    "    # 1. Create query string\n",
    "    query_string = # TODO: Your code goes here\n",
    "\n",
    "    # 2. Load results into DataFrame\n",
    "    df = # TODO: Your code goes here\n",
    "\n",
    "    # 3. Write DataFrame to CSV\n",
    "    df.to_csv(\"taxi-{}.csv\".format(phase.lower()), index_label = False, index = False)\n",
    "    print(\"Wrote {} lines to {}\".format(len(df), \"taxi-{}.csv\".format(phase.lower())))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that even with a 1/5000th sample we have a good amount of data for ML. 150K training examples and 30K validation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3> Verify that datasets exist </h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls -l *.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preview one of the files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head taxi-train.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks good! We now have our ML datasets and are ready to train ML models, validate them and test them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Establish rules-based benchmark\n",
    "\n",
    "Before we start building complex ML models, it is a good idea to come up with a simple rules based model and use that as a benchmark. After all, there's no point using ML if it can't beat the traditional rules based approach!\n",
    "\n",
    "Our rule is going to be to divide the mean fare_amount by the mean estimated distance to come up with a rate and use that to predict. \n",
    "\n",
    "Recall we can't use the actual `trip_distance` because we won't have that available at prediction time (depends on the route taken), however we do know the users pick up and drop off location so we can use euclidean distance between those coordinates."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### **Exercise 3**\n",
    "\n",
    "In the code below, we create a rules-based benchmark and measure the Root Mean Squared Error against the label. The function `euclidean_distance` takes as input a Pandas dataframe and should measure the straight line distance between the pickup location and the dropoff location. Complete the code so that the function returns Euclidean distance between the pickup and dropoff location. \n",
    "\n",
    "The `compute_rmse` funciton takes the actual (label) value and the predicted value and computes the Root Mean Squared Error between the the two. Complete the code below for the `compute_rmse` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "def euclidean_distance(df):\n",
    "    return # TODO: Your code goes here\n",
    "\n",
    "def compute_rmse(actual, predicted):\n",
    "    return # TODO: Your code goes here\n",
    "\n",
    "def print_rmse(df, rate, name):\n",
    "    print(\"{} RMSE = {}\".format(compute_rmse(df[\"fare_amount\"], rate * euclidean_distance(df)), name))\n",
    "\n",
    "df_train = pd.read_csv(\"taxi-train.csv\")\n",
    "df_valid = pd.read_csv(\"taxi-valid.csv\")\n",
    "\n",
    "rate = df_train[\"fare_amount\"].mean() / euclidean_distance(df_train).mean()\n",
    "\n",
    "print_rmse(df_train, rate, \"Train\")\n",
    "print_rmse(df_valid, rate, \"Valid\") "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The simple distance-based rule gives us an RMSE of <b>$7.70</b> on the validation dataset.  We have to beat this, of course, but you will find that simple rules of thumb like this can be surprisingly difficult to beat. \n",
    "\n",
    "You don't want to set a goal on the test dataset because you'll want to tweak your hyperparameters and model architecture to get the best validation error. Then, you can evaluate ONCE on the test data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Challenge exercise\n",
    "\n",
    "Let's say that you want to predict whether a Stackoverflow question will be acceptably answered. Using this [public dataset of questions](https://console.cloud.google.com/marketplace/details/stack-exchange/stack-overflow?filter=solution-type:dataset&q=stack%20overflow), create a machine learning dataset that you can use for classification.\n",
    "<p>\n",
    "What is a reasonable benchmark for this problem?\n",
    "What features might be useful?\n",
    "<p>\n",
    "If you got the above easily, try this harder problem: you want to predict whether a question will be acceptably answered within 2 days. How would you create the dataset?\n",
    "<p>\n",
    "Hint (highlight to see):\n",
    "<p style='color:white' linkstyle='color:white'> \n",
    "You will need to do a SQL join with the table of [answers]( https://bigquery.cloud.google.com/table/bigquery-public-data:stackoverflow.posts_answers) to determine whether the answer was within 2 days.\n",
    "</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2022 Google Inc.\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at\n",
    "http://www.apache.org/licenses/LICENSE-2.0\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
